This paper seeks to understand salary for machine learning professionals across the world and across levels of experience, as well as what variables are most predictive. It analyzes a recent and rich data set with over sixteen thousand respondents from 171 countries and over 200 variables. This paper utilizes multivariable regression through best subset selection after feature selection using LASSO and Random Forest as the main prediction techniques. Random Forest performed significantly better than linear regression with test set RMSE of 17k vs 28k, which is a significant difference. Random forest’s ability to capitalize on highly nonlinear relationships and high order interactions proved useful for this data set.
Another contribution of this study is its analysis of variable importance. While there are a some caveats to consider when looking at results (these caveats are elaborated upon in the body of the document), the most important variables are country, age, tenure, and employer type and industry, and title. These results may not be suprising, but what is interesting is what variables are missing from the top importance ranks. For example, college major, and how people were trained in machine learning (a high proportion are self taught/ learned online) are not important predictors. One hypothesis for why major and training method is not as important is that the industry is still nascent and growth is strong; this combination is driving a need for talent. Since it is well documente that there is a shortage of talent with machine learning skills, it is likely that the barriers to entry are relatively low and people with a strong intrest and willingness to invest time to learn can still do so.
Kaggle is an online community of data scientists and machine learners, owned by Google, Inc. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.
For the first time, Kaggle conducted an industry-wide survey to establish a comprehensive view of the state of data science and machine learning. The survey received over 16,000 responses and we learned a ton about who is working with data, what’s happening at the cutting edge of machine learning across industries, and how new data scientists can best break into the field.
This is a rich data set with valuable information relating to an online data science community which is new and expanding.
This survey received 16,716 usable respondents from 171 countries and territories. If a country or territory received less than 50 respondents, they are grouped in “Other” for anonymity.
Respondents who were flagged by the survey system as “Spam” or who did not answer the question regarding their employment status (this question was the first required question, so not answering it indicates that the respondent did not proceed past the 5th question in our survey). Most of our respondents were found primarily through Kaggle channels, such as email lists, discussion forums and social media channels. The survey was live from August 7th to August 25th. The median response time for those who participated in the survey was 16.4 minutes. We allowed respondents to complete the survey at any time during that window.
We sourced the data from data.world, an online platform hosting many data sets. There were four separate data sets originally:
responseData : All survey responsesquestions : Data relating to the questions asked and to whom they were posed (respondents saw a different set of questions based on their answers to early questions)exchangeRates : Exchange rates from the date of surveyfreeformResponses.csv : Free Form Responses (which are randomised to protect the identity of the respondents so cannot be used for our prediction purposes as we cannot map them to a salary)In order to analyse the salaries we first convert the salaries to USD using the additional exchange rate data set. We also produce an additional dataset compData containing the respondents who gave salaries.
Important: We implicitly assume here that those reporting salaries do so randomly conditioning on being employed which we can discuss in the conclusion.
We pose the overall question:
And we hope to answer:
We can even attempt to predict the salary value of knowing the STAT 701 materials inside out.
Where there is a question such as which work tools are used select all, we remove the freeform question, and replace the NAs in the follow on questions with “missing” which indicates the tool was not used at all. For example, if a respondent identified they used only Java, it is reasonable to assume they never use Python. We further remove the questions asking the user to “select all that apply” as these are difficult to use and overlap with
First we report the number of NAs as a percentage of each variable to give an idea of the distribution of NAs. In order to perform regression analysis we need to manage all of these NAs.
To clean the data for analysis of salaries, once we have selected only those respondents reporting salary, we convert the salaries into USD. We then remove any columns with all NAs as these are from questions only asked to repondants who self identified as being unemployed or unsalaried.
We start with > 200 variables in our data set, and some of these we remove from experience as we wish to reduce the predictor space and they will not be useful.
These are answers to the questions:
At work, how often did you experience these barriers? is unlikely to be predictive of salaryWe further remove the following questions as we do not expect they will be useful in predicitng salary. or there are two many factors to use practically (again these are really characters and would be more relevant for an NLP processing which is time-consuming for little value)
WorkDataSourcingWorkDatasetsWorkDatasetsChallengeEmployerSizeChangeEmployerMLTimeTitleFitEmployerSearchMethodWorkDatasetSizeWorkCodeSharingWorkDataStorageWorkDataSharingCodeWriterMLToolNextYearSelectLanguageRecommendationSelectPublicDatasetsSelectLearningPlatformSelectPastJobTitlesSelectMLSkillsSelectWorkHardwareSelectWorkAlgorithmsSelectWorkAlgorithmsSelectWorkMethodsSelectTimeOtherSelectWorkChallengesSelectWorkChallengeFrequencyOtherSelectMLTechniquesSelectWorkToolsSelectexhcangeRateWe also remove:
TimeOtherSelectLearningCategoryOtherAs these are 100% - the percentage allocated to the other categories (which by construction all sum to 100%) and we wish to avoid linear dependence of our columns for model fitting.
Finally, we remove currency and local currency salary, as we are interested in the response (our “y”) which is CompensationAmountUSD. We filter out salaries above 1mm USD and below 100 USD as these appear to be misreported.
Post data cleaning, reducing and refactoring, we are left with ~94% complete cases, and only one variable has more than 1% NAs. We make the leap that we can simply get rid of these as they comprise a small part of the resulting data. We are implicitly assuming that the NAs occur randomly when we remove them based on a survey not being filled out correctly fo example. We are left with a data frame with 4180 observations, and 140 variables including the response, CompensationAmountUSD.
We split the data into 80% / 20% randomly before we perform our EDA. The goal here is to set aside a test data set which validates our model and behaves as “unseen” data. We must make this split pre EDA as otherwise we are snooping.
First, we plot the response variable we are trying to model, CompensationAmountUSD.
We note that the salaries are heavily right skewed, and take the log (base 10 for ease of interpretation) to try and pull in some of the very high salaries.
The distribution is not normal, but it is much closer to normal tham for the raw values, so we make the transformation as we expect to have a better chance of making good predictions with normally distributed response variable.
We are intested in how logCompensationAmountUSD, our response, varies with other variables in our data set. First let’s have a look at salaries by geography.
More factor analysis vs. salary. We only report factors here that have > 25 members. The tree plots report tile size proportional to the size of each population.
For the questions of the form How useful did you find these platforms... (and similar of which there are a few), the responses are sparsely populated variables, since each variable is only populated in the case where a respondent, for example, does engage with the platform.
We tally the number of respondents who use a particular work tool here. In reality we also have information on how regularly a person uses a tool, althoguh this is quite granular so it remains to be seen if it will be important in predicting salary.
Each respomdent was asked to divide up how they spend their time into 5 categories.
We look at the distributions of how users report that they allocate their time.
Now we plot the time spent on each category at work vs. (log) compensation to see if there are any relationships.
We also have data on the proportion of time spent by users learning data science across various categories, whether on the job, online etc.
We look at the distributions of how users claim they learned data science.
Next we plot the time spent on each category learning vs. (log) compensation to see if there are any relationships.
There is a comparison here between experts and a “Jack of all trades”" - the true differentiation occurs where respondents focus on only one task for the majority of their time. For example, those who spend the majority of their time finding insights earn more money on average. Those who spend the majority of their time visualising earn less on average.
We begin our modelling procedure by checking the Variance Inflation Factors to check that, after our data preprocessing and EDA, we are not left with any auto-correlation. We confirm that the maximum VIF (when adjusted for df using the car::vif function) is below 5, the generally accepted cutoff.
| variable | VIF |
|---|---|
| LearningCategorySelftTaught | 3.930847 |
We will build random forest models to be compared with LASSO and neural net models.
We will begin by tuning the number of trees in our forest and the number of predictors split at each tree (mtry), such that out-of-bag (OOB) MSE is minimized. The following graph shows the effect of the number of trees on OOB MSE, assuming a default mtry value of 12, which is approxiamtely equal to the square root of the number of variables (used by default).
Based on this graph, we believe the OOB MSE stabalizes at a minimum at around n=100, so we will build all future forests using 100 trees.
Next, to decide what value of mtry will minimize the error, we have calculated how OOB MSE varies as we change mtry. The OOB MSE is stabalized at a minimum near an mtry of 50. Consequently, we will model our random forest by using an mtry of 50.
For our final model, we use 100 trees and an mtry of 50. The final model used the following cuts:
For the Lasso model we expanded the categorical variables into binary levels and allowed the regsubsets to pick a subset of the original countries. This gives the model more flexibility, but doesn’t mean that in comparing the variable importance between random forest and lasso we are not quite comparing like for like. Nonetheless it is valuable to have the comparison.
Consider how variable importance is calculated for LASSO. Like normal linear regression the LASSO seeks to miniize the sum of squared error; however, unlike OLS, it adds a penalization term (l1) to control dimensionality. The least important predictors become zero first as the size of the penalty \(\lambda\) increases.
Now consider how variable importance is calculated in Random Forest. In slightly simplified terms, the measure of importance is how much would predictive accuracy decrease if the variable was removed. Contribution accuracy is the gold standard, especially in the context for this study where we ae looking to predict. However, it is important to note the limitations of this plot.
The first one is that it only shows the contribution of each variable to prediction accuracy, but it doesn’t show how the combination of variables contribute to it -interaction effects in statistical parlance. This means the sum of contributions won’t add up to total accuracy. We also don’t know the functional form of the relationship, or in other words, we don’t know how a unit change in the value of the variable would affect the predictor.
Having elaborated on the differences, we can now look at the results
If we ignore the fact that LASSO expands the categorical variables, we can see that country, age and tenure are all important according to both models.
Radnom Forest alloows us to see a few more variables easily as it agreggates them more. We can see that after the three variables mentioned above, employer type, industry, and title are most important. It is interesting to see that college major, and how people were trained in machine learning (a high proportion are self taught/ learned online) are not nearly as important. It is likely that a combination of strogn growth and that machine learning is in the early stages of adoption means that talent can more ore less easily transition into the industry after some self study or online training.
| RandomForest | Lasso1se | LassoMin |
|---|---|---|
| 16700.04 | 28631.52 | 28584.06 |
The table above shows that Random Forest was significantly superior to linear regression. This is not surprising when prediction is the priority and the data requires a complicated function to relate the predictors to the response. If future algorithms were attempted, these results would suggest that one would try those that allow for a complex function to be fitted, such as boosting.